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ABSTRACT 

Aim; The aim of this study was to compare alternatives methods for analysis of zero inflated count data and compare 
them with simple count models that are used by researchers frequently for such zero inflated data. 
Background: Analysis of viral load and risk factors could predict likelihood of achieving sustain virological response 
(SVR). This information is useful to protect a person from acquiring Hepatitis C virus (HCV) infection. The distribution 
of viral load contains a large proportion of excess zeros (HCV-RNA under 100), that can lead to over-dispersion. 
Patients and metliods: This data belonged to a longitudinal study conducted between 2005 and 2010. The response 
variable was the viral load of each HCV patient 6 months after the end of treatment. Poisson regression (PR), negative 
binomial regression (NB), zero inflated Poisson regression (ZIP) and zero inflated negative binomial regression (ZINB) 
models were carried out to the data respectively. Log likelihood, Akaike Information Criterion (AIC) and Bayesian 
Information Criterion (BIC) were used to compare performance of the models. 

Results: According to all criterions, ZINB was the best model for analyzing this data. Age, having risk factors genotype 
3 and protocol of treatment were being significant. 

Conclusion: Zero inflated negative binomial regression models fit the viral load data better than the Poisson, negative 
binomial and zero inflated Poisson models. 
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Introduction 

Hepatitis C virus (HCV) infection is a major 
cause of liver diseases worldwide and represents a 
major public health problem (1-5). Both transfusion 
and contact with infected blood and its products, 
intravenous drug abuse, contamination during 
medical procedures and lack of attention to health 
precautions are different risk factors of HCV(6, 7). 
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Between 130 and 170 million people are infected 

with HCV worldwide and the global prevalence of 
this infection is 2.2%-3% (2, 8, 9). But this 
prevalence varies between countries and between 
developed world and imdeveloped countries because 
of difference in health policy and medical care(lO). 
There is no exact estimation of HCV infection in 
Iran and estimates rely upon studies that have been 
performed on high-risk groups or a specific 
geographic location. Two Iranian studies examined 
the prevalence of HCV infection in the general 
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population and estimated a population prevalence of 
less ttian 1% in Iran (11, 12). 

Risk factor evaluation and interventions to 
decrease the problem in communities is one 
solution to protect people from acquiring the 
infection. In this paper viral load of HCV patient 
and related factors of them that can effect on low 
or high viral load were examined. 

Viral load, like other count data needs count 
models to analyzing (13). PR model is one of the 
most established count models used by 
researchers. The important assumption of the PR 
model is that the data must not have any over- 
dispersion — a larger variability than expected (13). 
Up until recent years, the NB model has been used 
to describe this distribution assuming that over- 
dispersion is only due to unobserved heterogeneity 
(14). The distribution of viral load contains a large 
proportion of excess zeros, (HCV RNA under 100), 
that can lead to an over-dispersion. In this situation, 
alternative models may be better at accounting for 
over-dispersion due to excess zeros (14). 

For independent counts with excessive zeroes 
Lambert proposed a ZIP regression model(15). 
Lambert showed this model had better fit than PR 
or NB models when data had excessive zero. 
Green in 1994 introduced ZINB model and 
showed sometimes extra over dispersion occur in 
zero inflated data, so the ZINB models had the 
best fit (16). Although the apphcation of these 
models and their comparisons with other count 
models has also increased in medical and health 
fields in recent years (14), but unfortunately many 
researchers in Iran are not familiar with these 
models and they use ordinary count models such 
as PR and NB for analyzing zero inflated count 
data. Comparison between these models is needed. 
A review of the application and comparison of 
such models in health research is also reported 
(17). The aim of this study was in two fold; firstly, 
to determine the factors of SVR in HCV patients 
and secondly to find the best model for analyzing 
this data. Ordinary count models such as PR and 



NB, ZIP model and ZINB regression model were 
used and compare to identify factors related to 
SVR in HCV patient. 

Patients and Metliods 

This cross-sectional study was a part of a 
larger longitudinal study that was conducted 
between 2005 and 2010. All data for this 
research was drawn from medical records of 186 
patients with hepatitis C, who were referred to 
Tehran hepatitis clinic, a clinical clinic of 
Bagiyatallah Research Center for 
Gastroenterology and Liver diseases between 
2005 through 2010. Patients who completed the 
period of treatment (duration dependent upon 
treatment regimen - for either 24 weeks or 48 
weeks) were included in this study and patients 
who did not complete their recommended period 
of freatment were omitted. Information relating 
to the 186 patients included viral load (HCV- 
RNA) after treatment, demographic information 
including sex and age, genotype including 
genotype 1, 2 and 3, and treatment protocol 
including combination therapy of standard 
interferon (3 MU three times a week) plus 
Ribavirin (800-1200 mg per day) for either 24 
weeks or 48 weeks (18-20) and a combination 
therapy of peg-interferon (Alfa 2a in a fixed 
dose of 180 micrograms per week) plus 
Ribavirin (800- 1200 mg per day) is for 24 
weeks either 48 weeks (19, 21), history of blood 
fransfusion, addiction (IV drug user) and needle 
stick as risk factors was exfracted from their 
medical records. 

The five covariates were age, sex, genotype, 
protocol of treatment and risk factors entered in 
this study. HCV-RNA negative (we considered 
zero in our analyzing) is defined as less than 
100. In figure one the process of study is shown 
in a flow diagram. 
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Figure 1. Diagram showing the process of study 

Descriptive statistics and frequency distribution 
such as mean, standard deviation and percentage 
were calculated according to standard methods. 
The outcome variable was the viral load of HCV 
patient. 66.5% of observations were zeros in this 
study because of SVR. PR model is one of the 
models from general linear models (GLM) for 
describing count outcomes or proportion/rates 
(13). Sometimes in PR the variances are much 
larger than the means, whereas Poisson 
distributions have identical mean and variance. 
The phenomenon of the data having greater 
variability than expected for a general linear 
model is called over-dispersion. A common cause 
of over-dispersion is heterogeneity among subjects 
(13). NB model, is another model from GLM as 
an alternative to the PR model, and is a solution to 
account for over-dispersion due to unobserved 
heterogeneity (14). Sometimes the NB model may 
not be appropriate if the over-dispersion due to an 
excess of zeros in the outcome. In such a situation, 
alternative models such as zero inflated models are 
recommended (15). Alternatively, if the non-zero 
observation parts does not follow the Poisson 
model then the ZFNB is used by considering count 
process as a negative binomial distribution (14). 
The ZINB model provides the possibility that 
account for the over-dispersion due to both types 
of excess zeros and unobserved heterogeneity (14, 
22). The models (e.g., PR versus NB and ZIP, NB 



versus ZINB, ZIP versus ZINB) were compared 
using the Vuong test and likehhood ratio test. To 
compare performance of the models, there are 
various methods such as log likelihood, Akaike 
Information Criterion (AIC) and Bayesian 
Information Criterion (BIC). The p-values less 
than 5% were considered as significant results. 
Stata 1 1 and R program were used for analyzing. 

Results 

A total of 186 patients were eligible and 
entered in this study. Of those in the study, 123 
(66.5%) of patient had SVR. According to the 
score test that is used for checking zero inflation, 
these data showed signiflcant zero inflation (p < 
0.001). The mean age of patients was 42.88 
(standard deviation, 11.17) years and range 1 9-76 
years. The distributions of covariates between 
patients are shown in table 1. The significant 
Pearson chi square goodness of fit (gof) test (p< 
0.001) along with other characteristics of model fit 
indicated that the PR model produced a poor fit 
for data. 



Table 1. The distribution of co variance between patients 



Variables 


categories 


N (%) 


Sex 


man 


55(29.6) 




woman 


131(70.4) 


Risk factor 


Positive 


104(55.9) 




negative 


84(44.1) 


Genotype 


1 


142(76.3) 




2 


4(2.2) 




3 


40(21.5) 


Protocol of treatment 


Inte*+ Rib* 


100(53.8) 




Peg-inte+ Rib 


86(46.2) 



*Inte: Interferon; Rib: Ribavirin 



In the NB model, the estimated dispersion 
statistic (a) was 3.51 (95% CI: 3.25, 3.77). A 
significant likehhood ratio test (p< 0.001) of 
dispersion statistic from zero favored the NB 
model over the PR model. Voung test was used for 
comparison between ZIP and PR. The significant 
result (p<0.001) showed that ZIP model was better 
than PR. But in comparison between ZIP and the 
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NB Voung test result was in favored of NB model. 
Between the ZINB and PR and ZINB and NB 
models the Voung test showed ZINB was better 
model too (p<0.001). For the significant likelihood 
ratio test (p < 0.001) the ZINB model was better 
than ZIP. The ZINB estimated dispersion 
parameter was observed different than zero as [(a 
= 1.87; 95% CI: (1.39, 2.52)]. Comparisons 
between models are shown in Table 2. 



Table 2. Comparison of model fit characteristics. 





PR 


NB 


ZIP 


ZINB 


AIC 


575547 


21307.6 


516065 


21196.6 


BIC 


575586 


21346.1 


516103 


21235.1 


Logliklihood 


-287767 


-10646 


-258025 


-10591 



The minimum AIC was observed for the 
ZINB model, followed by NB model. However, 
other validity indices of the model (maximum 
log likelihood, minimum BIC) favored ZINB 
over all other models. So ZINB model was the 
best model for analyzing this data. Table 3 
showed the results of this model. Age, risk 
factor genotype 3 and protocol of treatment had 
significant relation with SVR of patient in ZINB 
model. According to these results including an 
increasing age (ADJ.OR=0.97; 95% CI 0.94, 
0.99; P=0.03) and having one risk factor 
(ADJ.OR=0.47 95% CI 0.24, 0.95; P=0.03) 
reduces the chance of SVR. For genotype 3 
(ADJ.OR=4.48; 95% CI 1.87, 12.82; P=0.001) 
combination therapy of Peg-interferon plus 
Ribavirin (ADJ.OR=2.41; 95% CI 1.22, 4.48; 
P=0.01) increased the chance of SVR. 

Discussion 

Achieving SVR is very important in treatment 
proceed of HC V. So in this study we examined the 
factors that related to SVR in HCV patient. 
Because of this reason that the majority of patients 
(66.5%) had SVR, our data set had a zero inflated 
form. Common approach for analyzing count data 
like viral load in our data are Poisson and negative 



binomial regression (13, 23) and there are a 
different method for excessive zero data such as 
zero inflated models that we used them in this 
paper and Hurdle models (24). There are lots of 
studies that they used these models recently (14, 
25-28). Goetzel et al used Poisson, negative 
binomial and zero inflated Poisson To quantify the 
direct medical and indirect (absence and 
productivity) cost burden of overweight and 
obesity in workers in the U.S (29). Carrel et al 
used a zero inflated negative binomial model to 
examine how residence within or outside a flood 
protected area interacts with the probability of 
cholera presence and the effect of flood protection 
on the magnitude of cholera prevalence(28). 
Bergemann and Huang proposed a new method 
based on zero-inflated Poisson (ZIP) regression 
likelihood to simultaneously account for missing 
genotype data and genotype combinations with 
zero counts (26). Dwivedi et al compared zero 
inflated models (Poisson and negative binomial) 
and hurdle models to test model abilities to predict 
the number of involved nodes in breast cancer 
patients (14). In this paper, NB, ZIP and ZINB 
models was carried out for examining the related 
factor with SVR in HCV patient and according to 
the results improved fit of the NB model over PR 
and ZIP, it clearly indicates that over-dispersion is 
involved due to unobserved heterogeneity and/or 
clustering. In addition, ZIP provided evidence of 
over-dispersion due to excess zero in viral lode of 
patients in comparison to the PR model. 
Comparing the ZIP and ZINB models according to 
likelihood ratio test, the ZINB model is more 
appropriated than ZIP. Beside, AIC, BIC and log 
likelihood criterion showed that ZINB model was 
better than the NB regression model, indicating 
that the NB model may not be appropriate for 
describing over-dispersed data. 

Young people had more SVR then older 
people. It seems some physiological change 
related to increasing the age was the reason of this 
results. Patient with genotypes 3 had more SVR 



Gastroenterol Hepatol Bed Bench 2013;6(1):41-47 



PourhoseinghoU A. et al 45 



Table 3. Zero inflated negative binomial model for cost data 



V CLL La U IC 


IM Cgtlli V C UlllUlilldl Udl I 




7 f*vr\ infliii'pri niirt 
Z_<C1U IlliltHCU Udl I 








1 V diUt 


AHi OR** ^95% ^^ 


1 V diUv 


Female(reference: male) 


0.98(0.95, 1.01) 


0.38 


0.49(0.24, 1.01) 


0.05 


Age 


1.15(0.55,2.40) 


0.7 


0.97(0.94, 0.99) 


0.03 


Risk factor (reference: negative) 


1.04(0.45,2.38) 


0.92 


0.47(0.24, 0.95) 


0.03 


Genotype 2 (reference: 1) 


0.001(0.00, 1.01) 


0.25 


2.43(0.22, 10.80) 


0.48 


Genotype 3 (reference: 1) 


0.50(0.16, 1.60) 


0.24 


4.48(1.87, 12.82) 


0.001 


Protocol of treatment (reference: 


0.81(0.36, 1.83) 


0.62 


2.41(1.22,4.48) 


0.01 


interferon+ribavirin) 











*Adjusted Relative Risk 



** Adjusted Odds Ratio 



than patient with genotype 1 . These results suggest 
that achieving SVR in genotype 1 is more difficult 
than for other genetypes and this has been 
confirmed in other studies(30, 31). Certain patient 
risk factors decrease the chance of SVR. An 
example is that genotype 1 is associated with 
patient risk factors such as illegal drug use , 
infection by transfusion and contact with infected 
blood and its products (32). Two main treatment 
protocols were be used in this study according the 
genotype of patient. Accordingly the results 
combination therapy of peg- plus Ribavirin had 
better results than combination therapy of standard 
interferon plus ribavirin. Many studies have been 
conducted so far showed that peg-plus Ribavirin 
had the highest likelihood of a SVR response to 
treatment (31, 33-37), especially for genotype 1. 
Genotype 1 responds to treatment poorly and this 
difficulty is recognized in choice of treatment 
protocol (36, 37). Unfortunately in Iran, due to the 
the cost of these expensive drugs, it is not the first 
choice of doctors. Usually after a patient does 
response to initial treatments with monotherapy, 
doctors decide to choose Peg- plus Ribavirin (37). 

In conclusion we have shown ZINB regression 
models is the best model for analyzing and 
describing viral load distribution. This confirms 
that the distribution of the viral load contained 
over-dispersion not only due to unobserved 
heterogeneity but also due to excessive negative 
HCV-RNA (zeros). As expected, the PR model 
had the worst model for HCV-RNA analyzing. 



Accounting only one source of over-dispersion, 
either due to excessive zeros or due to unobserved 
heterogeneity, the gof of models improved as 
indicated by ZINB, NB and ZIP models. To 
analyze count data with zeros it is essential to 
check the assumptions of different count models 
and then using the appropriate count model is 
essential to have meaningful results. 
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